Note: An implementation of the theory in this notebook can be found in this notebook. The sections closely mirror each other.
The entropy of a random variable is a measure of the uncertainty in its value.
Consider a discrete random variable $X$ with underlying sample space $S_X$ and pmf $p[x] = P[X=x]$. If $p[x_i] = 1$ for any $x_i$ then clearly there is no uncertainty in this random variable.
If, on the other hand, $p[x_i] = \frac{1}{N} \forall \space x_i \in \{x_i : X(s_i) = x_i ; s_i \in S_X\}$ i.e. every single possible outcome of the random variable has an equal probability; there is a much greater uncertainty in each possible outcome. In the case where $x_i \in \{a , b\}$ this would correspond to $P[X=a] = P[X=b] = \frac{1}{2}$. An appropriate measure for this uncertainty is found to be
$$ I(x) = \log_2{\frac{1}{p[x]}} \tag{1} $$This measure satisfies our discussion above as when we have a high probability of an outcome, the uncertainty measure $I$ is low (with a minimum at 0) and when we have a low probability, the uncertainty can be arbitrarily large.
Considered across all possible values a random variable may take, we get the expected value of the uncertainty:
$$ \begin{split} H(p) &= \mathbb{E}[I(x)] \\ &= \sum_{x\in X}p[x]\log_2{\frac{1}{p[x]}} \end{split}\tag{2} $$We call this measure $H(p)$ the entropy for the pmf p.
For more details see: Leon-Garcia (2008, Chapter 4.10.1)
Entropy is represented in units of bits (when we use $\log_2$ for the uncertainty measure) and can be seen as a measure of information as well. Since, from above, we know that with a larger number of events the uncertainty increases, it makes sense that the greater the entropy, the more information is required to represent the average event. If we assign codewords in order to represent each possible outcome, the following results are found:
Returning to the equiprobable 2 outcome case from earlier, we find that the entropy is: $\log_2{2} \implies 1\text{bit}$. Indeed since the pmf consists of powers of $\frac{1}{2}$ the code we find the code that achieves the entropy is $0$ represents outcome a and $1$ represents outcome b (or vice-versa).
For more details see: Leon-Garcia (2008, Chapter 4.10.2)
We see above that the entropy is the lowest possible average codeword length for any code. This is because in the calculation of the entropy, we take into account the correct probability distribution in the determination of the uncertainty of an outcome.
For example in our equiprobable two outcome case, we use the fact that the two outcomes have probabilities of $\frac{1}{2}$ each in the denominator of the log (this is different from the probability distribution across which the expected value operator is evaluated) which gives us the true value of the entropy. If instead we used a different probability distribution, we would end up with a different entropy value:
$$ H(p, q) = \sum_{x\in \mathcal{X}}p[x]\log_2{\frac{1}{q[x]}}\tag{3} $$where $p[x] = P[X_1 = x], \space q[x] = P[X_2 = x]$, $X_1, X_2$ are random variables and $\mathcal{X}$ is the support for both $p$ and $q$ i.e. every outcome assigned a probability by one pmf, has an assigned probability in the other pmf. We define this to be the cross-entropy.
This cross-entropy value can be seen as the information required to represent the random variable when we use a different probability distribution to determine the uncertainty.
Returning to the equiprobable two outcome case, consider an alternate pmf $q$ s.t. $q[a] = \frac{1}{3}$ and $q[b] = \frac{2}{3}$. $$ \begin{split} H(p,q) &= \frac{1}{2}(\log_2{3}+\log_2{\frac{3}{2}})\\ &= 1.1 \text{bits} \end{split} $$
In fact, cross-entropy will always be greater than or equal to the true entropy because we are using the 'wrong' uncertainty values in determining the minimum number of bits needed to represent the information with equality achieved only when the two distributions are the same.
Since we know that the cross-entropy will always be greater than the true entropy, we can find the additional bits required when we use this different probability distribution in the estimation of our uncertainties over the true minimum we get from the entropy. This measure is known as the relative entropy (also the Kullback-Leibler divergence $D_{\mathit{KL}}$) as introduced by Solomon Kullback and Richard Leibler in 1951 (Kullback and Leibler, 1951).
$$ \begin{split} D_{\mathit{KL}}(P \| Q) &= H(p,q) - H(p)\\ &= \sum_{x\in \mathcal{X}}p[x]\log_2{\frac{p[x]}{q[x]}} \end{split}\tag{4} $$This measure can also be seen as the difference between two probability distributions - a probability distribution $q$ from some reference probability distribution $p$. Clearly in the example considered above, this would be 0.1 bits i.e. the additional bits required when we use the wrong pmf to esimate uncertainty of an outcome.
In machine learning tasks the goal is often to learn a model to represent some true underlying probability distribution or function which the data follows. In order to do so, we must
The maximum likelihood method is a procedure for Step 2.
Given a class of models, there are likely to be some parameters of this class which, when correctly tuned, yield the true distribution. In order to determine these parameters, we introduce the likelihood function.
If our task is to represent some true underlying distribution for observed data points $\mathbf{x} = \begin{bmatrix} x_1, x_2, \dots, x_n \end{bmatrix}$ where the parameters of the underlying distribution are $\theta$ then the likelhihood function is defined:
$$ l(x_1, x_2, \dots, x_n ; \theta) = p_x(x_1, x_2, \dots, x_n | \theta)\tag{5} $$where $p_x$ is the probability that we observe these points given some value of $\theta$.
If, instead, our task is to represent some true function, i.e. $y = f(\mathbf{x})$ where we have knowledge of some $\mathbf{x} = \begin{bmatrix} x_1, x_2, \dots, x_n \end{bmatrix} \text{and} f(\mathbf{x}) = y$, and the parameters of the function $f$ are $\theta$ then the likelihood function may be defined as:
$$ \begin{split} l(y;\mathbf{x}; \theta) &= l(y;x_1, x_2, \dots, x_n; \theta)\\ &= p_f(y|x_1, x_2, \dots, x_n; \theta) \end{split}\tag{6} $$i.e. the likelihood is the probability that we observe the known outcome on the given input for a specific choice of the parameters.
The task is then to determine the value of $\theta$ for which either of these likelihood functions is maximised i.e. $$ \underset{\theta}{\text{argmax}}\space l(\mathbf x; \theta)\tag{7} $$ In this discussion, will consider only the first case, where we are considering some true underlying distribution from which data is generated but the subsequent steps are much the same in the latter case as well.
Assuming that each data point $x_1, x_2, \dots, x_n$ in $\mathbf{x}$ that we observe is independently drawn from the given distribution, the likelihood function becomes:
$$ \begin{split} l(\mathbf{x}; \theta) &= p_x(x_1, x_2, \dots, x_n | \theta)\\ &= p_x(x_1 | \theta)\times p_x(x_2 | \theta)\times \dots\times p_x(x_n | \theta)\\ &= \prod_{i=1}^np_x(x_i|\theta) \end{split}\tag{8} $$And our task becomes to find:
$$ \underset{\theta}{\text{argmax}} \prod_{i=1}^np_x(x_i|\theta)\tag{9} $$Since the log function is monotonically increasing, we note that maximising $\log{l(\mathbf{x}; \theta)}$ is equivalent to maximising this above likelihood function. Thus, the product we derived above becomes:
$$ \begin{split} \underset{\theta}{\text{argmax}}\space l(\mathbf{x};\theta) &= \underset{\theta}{\text{argmax}} \log{l(\mathbf{x}; \theta)}\\ &= \underset{\theta}{\text{argmax}} \log{\prod_{i=1}^np_x(x_i|\theta)}\\ &= \underset{\theta}{\text{argmax}} \sum_{i=1}^n \log{p_x(x_i|\theta)} \end{split}\tag{10} $$Just as the argmax operator produced the same result on the application of the log function, because the latter is monotonically increasing, it similarly produces the same result on scaling so:
$$ \begin{split} \underset{\theta}{\text{argmax}}\space l(\mathbf{x};\theta) &= \underset{\theta}{\text{argmax}}\frac{1}{n}\sum_{i=1}^n\log{p_x(x_i|\theta)}\\ &= \underset{\theta}{\text{argmax}}\space\mathbb{E}_{x\sim{p_{data}}}\log{p_x(x|\theta)} \end{split}\tag{11} $$where the expectation is over the emperical distribution i.e. the distribution that we have observed and the $x\sim p_{data}$ indicates that the sequence $x_n$ is the emperical distribution $p_{data}$. The reason we find that this is an expectation is because each instance in the emperical distribution occurs once and thus with probability $\frac{1}{n}$.
But this expectation is the negative cross entropy of the model distribution with respect to the emperical distribution i.e.
$$ \begin{split} \mathbb{E}_{x\sim{p_{data}}}\log{p_x(x|\theta)} &= -H(p_{data}, p_{x})\\ \implies \underset{\theta}{\text{argmax}}\space \mathbb{E}_{x\sim{p_{data}}}\log{p_x(x|\theta)} &= \underset{\theta}{\text{argmin}}H(p_{data}, p_{x})\\ \end{split}\tag{12} $$Thus, the method of maximum likelihood translates equivalently to minimising the cross entropy of the true distribution with respect to the observed distribution. Further relating this to the Kullback-Leibler Divergence, we find that minimising the cross entropy, by changing the parameter $\theta$ is equivalent to minimising the Kullback-Leibler Divergence
$$ \underset{\theta}{\text{argmin}}H(p_{data}, p_{x}) = \underset{\theta}{\text{argmin}}\Big\{H(p_{data}, p_{x}) - H(p_{data})\Big\}\tag{13} $$since $H(p_{data})$ does not change with varying $\theta$.
Thus we have found that we have found that the method of maximum likelihood is actually trying to minimise the KL Divergence between the emperical distribution that we can observe and the true distribution. This fits in nicely with the understanding derived earlier of $D_{\mathit{KL}}$ being the difference between two probability distributions.
For more details see: Goodfellow et al. (2016, Chapter 5)
In order to motivate the use of the method of maximum likelihood, it is important to know that it will yield the true parameter of the underlying distribution or at least a 'good' estimator. We have some evidence in the above relation to the KL Divergence. The Cramer-Rao Inequality gives us a much stronger evidence.
What is a 'good' estimator? An unbiased, consistent estimator with low variance could be termed good. But there is nothing in our previous discussion that implies that our estimator will be any of this. The bias will be determined by the class of the model from which we make our choice of model, the consistency is determined by its behaviour as the number of samples increases and the variance by the range of values that our chosen model may assume. While already previously stated that we will not be dicsussing the choice of class of models here, the consistency is easy to determine from the behaviour of the model as samples increase. Thus we focus here on the variance.
The Cramer-Rao Inequality gives us a lower bound on the variance that an unbiased estimator of a model may have. Before we discuss it, however, we must introduce the score function and Fisher Information.
The score function is defined as the rate of change of the log-likelihood function with a change in its parameters $\theta$ i.e.
$$ s(\theta) := \frac{\partial \log{p_x(\mathbf{x}| \theta)}}{\partial \theta}\tag{14} $$The Fisher information for a parameter is defined as the expected value of the square of the score function i.e.
$$ \mathcal{I}(\theta) := \mathbb{E}\bigg[\bigg\{\frac{\partial\log{p_x(\mathbf{x}| \theta)}}{\partial \theta}\bigg\}^2\bigg]\tag{15} $$When the log-likelihood function changes rapidly with a change in $\theta$ we expect that we should be able to discern more easily the true parameter because changes in the log likelihood statistic will be visible on changing $\theta$. If it were to change slowly on varying $\theta$, then it would be tougher to discern models with different $\theta$ values. Hence in the former case - which corresponds to a high value of Fisher Information - we expect to be able to more easily determine the true value of $\theta$ (if at all) and in the latter - where the Fisher Information is low - we expect not to be able to precisely determine the true value of $\theta$.
For more details see: Leon-Garcia (2008, Chapter 8.3.1)
The Cramer-Rao Inequality states that under certain regularity conditions (given below), the minimum variance that an unbiased estimator $\hat \Theta(\mathbf{x})$ for the parameter $\theta$ is given by the inverse of the Fisher Information Criterion i.e.
$$ \text{Var}_\hat\Theta \geq \frac{1}{\mathcal{I}(\theta)}\tag{16} $$Continuing our discussion of Fisher Information from above, the higher the Fisher information, the lower the CR bound and hence the lower bound on the variance that an unbiased estimator must at least assume.
For more details see: Leon-Garcia (2008, Chapter 8.3.1)
The CR bound relies on the following two regularity assumptions:
The values of $x$ which satisfy $p_x(x|\theta)>0$ are independent of $\theta$ and $\frac{\partial \log{p_x(x|\theta)}}{\partial \theta}$ exists for all these values of $x$ and all values which $\theta$ may assume in selected class of models
$\frac{\partial}{\partial \theta} \Big[\int \hat \Theta(\mathbf{x}) p_x(x|\theta) dx\Big] = \int \hat \Theta(\mathbf{x})\frac{\partial}{\partial \theta}p(x|\theta)dx$
For more details see: Bickel and Doksum (2015, Chapter 3.4.2)
Bickel, P. J., & Doksum, K. A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, Second Edition (Chapman & Hall/CRC Texts in Statistical Science Book 117): Vol. I (2nd ed.). Chapman and Hall/CRC.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning (Adaptive Computation and Machine Learning series) (Illustrated ed.). The MIT Press. https://www.deeplearningbook.org/
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694
Leon-Garcia, A. (2008). Probability, Statistics, and Random Processes for Electrical Engineering (3rd ed.). Prentice Hall.